You may find yourself trying to generate some output, and the output is wrong, where “wrong” means “is not accepted by whatever program consumes that output.” It is often the case that you have a reference implementation, which I somewhat whimsically call an “oracle”, that is known to produce acceptable output. In that case, you can use that reference implementation to check your implementation.¹
Give your implementation and the reference implementation the same input, and see if they produce the same output. For binary formats, you can use a binary file viewer program. In a pinch, you can use the Windows comp.exe
or fc.exe /b
programs. The files will not be identical,² and the tool will tell you where they differ. Work backward from that file offset to your program to see why your program chose to write the wrong data at that point. (Maybe you forgot to open the output file in binary mode?)
¹ I used a variation of this technique some time ago when reverse-engineered the calling convention for various historical CPU architectures.
² If the file contents are identical at the byte level, yet one is accepted and the other is rejected, then the problem is not in the file contents. Maybe there is file metadata that does not match, like the Mark of the Web.
This seems like the most sensible approach. In the end simply everything is a bunch of bytes.
I’m somewhat surprised when developers struggle with this. The file looks the same but doesn’t work? Whip out the hex viewer.
The simple cases are the Unicode BOM, or line break differences (e.g. some programs still struggle with CRLF line breaks today)
More fun pops up with binary protocols, BCD encoded fields, dynamic lengths with length bytes in various formats. Bit it baby 🙂
Yeah, some years back, I was replacing a legacy code generation tool, the unmaintainability of which was increasingly holding back work in that area. And it produced *horrible* code, but the first step of the replacement was to get a new tool which could create exactly the same horrible code… as close to byte-for-byte identical as I could make it. Only once the new tool was producing the same code as the “evil oracle” could I start revising the templates to improve on what was coming out.
This hits my competitive programming experience hard. We would write a brute-force, easily correct program, then write (I think from scratch) another using a better data structure / algorithm, and generate some random test data to check for bugs in the better program. (This procedure gets a special name but I don't know the word/phrase in English.) Also, "Maybe you forgot to open the output file in binary mode?" was one frequent reason because in "ancient" times, we programmed on Windows, the standard answer was produced on Linux, and some judgers were byte-by-byte (most ignored line-trailing spaces including CR as...
Or the filename. VLC of all things (french media player), would in older versions refuse to play files with nordic characters in the name. And not just refuse, but give an error message, that looked like there was an error in the file. Fixed years ago.